For example,Бобцов

ViSL One-shot: generating Vietnamese sign language data set

Annotation

The development of methods for automatic recognition of objects in a video stream, in particular, recognition of sign language, requires large amounts of video data for training. An established method of data enrichment for machine learning is distortion and noise. The difference between linguistic gestures and other gestures is that small changes in posture can radically change the meaning of a gesture. This imposes specific requirements for data variability. The novelty of the method lies in the fact that instead of distorting frames using affine image transformations, vectorization of the sign language speaker’s pose is used, followed by noise in the form of random deviations of skeletal elements. To implement controlled gesture variability using the MediaPipe library, we convert to a vector format where each vector corresponds to a skeletal element. After this, the image of the figure is restored from the vector representation. The advantage of this method is the possibility of controlled distortion of gestures, corresponding to real deviations in the postures of the sign language speaker. The developed method for enriching video data was tested on a set of 60 words of Indian Sign Language (common to all languages and dialects common in India), represented by 782 video fragments. For each word, the most representative gesture was selected and 100 variations were generated. The remaining, less representative gestures were used as test data. The resulting word-level classification and recognition model using the GRU-LSTM neural network has an accuracy above 95 %. The method tested in this way was transferred to a corpus of 4364 videos in Vietnamese Sign Language for all three regions of Northern, Central and Southern Vietnam. Generated 436,400 data samples, of which 100 data samples represent the meaning of words that can be used to develop and improve Vietnamese sign language recognition methods by generating many variations of gestures with varying degrees of deviation from the standards. The disadvantage of the proposed method is that the accuracy depends on the error of the MediaPipe library. The created video dataset can also be used for automatic sign language translation.

Keywords

Articles in current issue